Biomedical Named Entity Recognition (BioNER) plays a key role in processing unstructured medical documents. In this paper, we explore a basic deep learning approach using transformer-based models to identify entities like diseases and genes in biomedical text. The models are evaluated using two standard datasets. The results show satisfactory performance and suggest that transformer models can be useful for basic biomedical text mining tasks. Some simple techniques like character-level features and sequence labeling are also used to improve prediction. This work can be extended in future by using more data and additional features. Biomedical Named Entity Recognition (BioNER) plays a vital role in biomedical text mining by extracting meaningful entities such as diseases, genes, and chemicals from unstructured textual sources. Despite significant advancements, challenges like noisy data, domain-specific terminology, and limited generalization remain. This paper presents a simplified dual-transformer model combining PubMedBERT and RoBERTa to enhance entity recognition in biomedical literature. PubMedBERT captures domain-specific features, while RoBERTa contributes general linguistic context. The outputs of both models are fused using attention-based concatenation and decoded using a Conditional Random Field (CRF) layer to ensure consistent entity labeling. Noise-aware data augmentation techniques are incorporated to improve robustness against misspellings and variations. The model is evaluated on benchmark datasets—NCBI Disease and BC2GM—and achieves a macro F1-score of 90.06% on the test set and 90.98% on the validation set, demonstrating reliable recognition of multi-token biomedical entities and domain-specific abbreviations. The results validate the effectiveness of combining domain-specific and general-purpose transformers in a lightweight framework suitable for real-world biomedical applications.
Introduction
The study presents a lightweight yet effective BioNER model that automatically identifies biomedical entities (e.g., diseases, drugs, genes) from unstructured text using a dual-encoder architecture combining PubMedBERT (domain-specific) and RoBERTa (general-purpose) with a Conditional Random Field (CRF) for structured sequence labeling.
Background & Motivation:
Traditional rule-based and statistical models (like CRFs and SVMs) are limited by lack of contextual understanding and reliance on handcrafted features.
Transformer-based models (e.g., BioBERT, PubMedBERT) have significantly improved performance in BioNER, but they are computationally heavy and require large annotated datasets.
There's a need for interpretable, efficient, and accurate BioNER models that perform well even in resource-constrained environments.
Proposed Model:
Uses PubMedBERT for biomedical-specific semantics and RoBERTa for general linguistic context.
Outputs from both are fused via a linear transformation and passed to a CRF layer to capture sequence-level dependencies.
Employs hybrid loss (cross-entropy + CRF loss) and data augmentation (e.g., noise injection) to boost robustness.
Handles long sequences using overlapping sliding windows and maintains label alignment during tokenization.
Dataset & Preprocessing:
Uses the NCBI-Disease dataset (annotated PubMed abstracts), following the IOBES labeling scheme.
Maintains label-token alignment during subword tokenization and avoids complex preprocessing.
Training Strategy:
Utilizes the Adam optimizer with learning rate scheduling and early stopping.
Trained over multiple epochs with batch sizes tuned for different settings.
Incorporates both classification and clustering methods for model validation.
Results:
The dual-encoder model (PubMedBERT + RoBERTa + CRF) achieved:
Accuracy: 98.53%
Precision: 0.89
Recall: 0.90
F1 Score: 0.90
Outperformed baseline models like:
PubMedBERT (F1: 0.87)
BioBERT (F1: 0.86)
RoBERTa (F1: 0.84)
Training/validation curves show stable convergence without overfitting.
Conclusion
This study proposes a robust and efficient dual-encoder framework for Biomedical Named Entity Recognition (BioNER), utilizing PubMedBERT and RoBERTa encoders along with a CRF decoding layer. The integration of a domain-specific encoder (PubMedBERT) with a general-purpose language model (RoBERTa) allows the system to effectively capture both biomedical terminology and contextual syntax. Through a fusion mechanism, the model combines token-level representations and projects them into a unified embedding space that is optimized for sequence labeling. The CRF layer enforces label consistency and improves boundary detection, especially for multi-token disease mentions. Experimental evaluation on the NCBI-Disease dataset shows that the proposed model consistently outperforms individual baselines in terms of F1-score, precision, and recall, thereby validating its applicability for real-world biomedical text mining
Despite achieving high performance, certain limitations remain that present directions for future research. While the model handles boundary-sensitive and multi-token entities well, challenges such as ambiguous abbreviations and rare terminology still require further enhancement. Future work may explore the incorporation of cross-lingual transformers to support multilingual biomedical texts, enabling broader applicability in international clinical datasets. Moreover, integrating domain-adaptive pretraining or curriculum learning may help the model generalize to unseen biomedical subdomains. To improve user trust and transparency, interpretability features—such as token-level attention heatmaps or decision rationales—could be incorporated into the prediction interface, allowing domain experts to verify and validate the model’s outputs more confidently in critical medical applications.
References
[1] Z. Urchade, P. Holat, N. Tomeh, and T. Charnois, “Hierarchical Transformer Model for Scientific Named Entity Recognition,” arXiv preprint, Mar. 2022.
[2] G. Çelikmasat, M. E. Aktürk, Y. E. Ertunç, and A. M. Issifu, “Biomedical Named Entity Recognition Using Transformers with BiLSTM?CRF and Graph Convolutional Neural Networks,” in INISTA, 2022, doi: 10.1109/INISTA55318.2022.9894270.
[3] H. Patel, “BioNerFlair: Biomedical Named Entity Recognition using Flair Embeddings and Sequence Tagger,” arXiv preprint, Nov. 2020.
[4] S. K. Hong and J.-G. Lee, “DTranNER: Biomedical Named Entity Recognition with Deep Learning-Based Label?Label Transition Model,” BMC Bioinformatics, vol.?21, article?53, 2020.
[5] J. Fries, S. Wu, A. Ratner, and C. Ré, “SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data,” arXiv preprint, Apr. 2017.
[6] L. Luo, Z. Wei, P. Lai, R. Leaman, Q. Chen, and Z. Lu, “AIONER: All?in?One Scheme-Based Biomedical Named Entity Recognition Using Deep Learning,” Bioinformatics, vol. 39, no. 5, 2023, doi: 10.1093/bioinformatics/btad310.
[7] L. Chai et al., “Hierarchical Shared Transfer Learning for Biomedical Named Entity Recognition,” BMC Bioinformatics, vol. 23, article?8, 2022.
[8] L. J. Han et al., “Multi?Level Biomedical NER Through Multi?Granularity Embeddings,” arXiv preprint, Dec. 2023.
[9] S. Z. Sun et al., “Transformer?Based Named Entity Recognition for Eligibility Criteria Parsing,” JMIR Medical Informatics, vol. 9, no. 2, 2021, doi: 10.2196/23943.
[10] H. Alamro, T. Gojobori, M. Essack, and X. Gao, “BioBBC: A Multi?Feature Model That Enhances Detection of Biomedical Entities,” Scientific Reports, vol.?14, p.?7697, 2024.
[11] K. Kanakarajan, A. Roy, N. Nanda, and V. P. Nair, “BioELECTRA: Pretrained Biomedical Text Encoder Using Discriminators,” IEEE Access, vol. 9, pp. 111135–111146, 2021.
[12] Y. Yin et al., “Augmenting Biomedical Named Entity Recognition with General-Domain Resources,” Journal of Biomedical Informatics, vol. 159, p. 104731, 2024, doi: 10.1016/j.jbi.2024.104731.
[13] V. Moscato, M. Postiglione, C. Sansone, and G. Sperlí, “TaughtNet: Learning Multi?Task Biomedical Named Entity Recognition From Single?Task Teachers,” IEEE J. Biomed. Health Inform., vol.?27, no.?5, pp.?2512–2520, 2023.
[14] L. Weber, J. Münchmeyer, T. Rocktäschel, M. Habibi, and U. Leser, “HUNER: Improving Biomedical NER with Pretraining,” Bioinformatics, vol.?36, no.?1, pp.?295–302, 2020.
[15] C. Sun et al., “Biomedical Named Entity Recognition Using BERT in the Machine Reading Comprehension Framework,” arXiv preprint, Sep. 2020.
[16] Z. Chai, H. Jin, S. Shi, S. Zhan, L. Zhuo, Y. Yang, and Q. Lian, \"Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition,\" IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 20, no. 1, pp. 807–817, Jan.–Feb. 2023, doi: 10.1109/TCBB.2022.3157630.
[17] T. Liang, C. Xia, Z. Zhao, Y. Jiang, Y. Yin, and P. S. Yu, “Transferring From Textual Entailment to Biomedical Named Entity Recognition,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 20, no. 4, pp. 2577–2587, 2023, doi: 10.1109/TCBB.2023.3236477.